14 research outputs found

    Using POS n-grams to detect grammatical errors in Finnish text

    Get PDF
    Automaattinen kieliopin tarkistus on hyödyllinen työkalu henkilöille, jotka kirjoittavat julkaistavia tekstejä. Kieliopintarkistimista on myös hyötyä kielenoppijoille. Suomen kielelle tehdyt käytetyimmät tarkistimet ovat sääntöpohjaisia, minkä vuoksi ne kattavat vain pienen osan kielioppivirheistä, ja sääntöjoukon laajentaminen vaati paljon käsintehtävää työtä. Tilastollisilla menetelmillä voidaan löytää suurempi määrä eri virheitä ilman käsinlaadittavia sääntöjä. Eräs helposti toteutettavissa oleva tilastollinen tapa on kerätä esimerkkijoukko kieliopillisia n-grammeja, ja verrata, löytyykö tarkistettavan lauseen kaikki n-grammit esimerkkijoukosta. Suomen kielessä on paljon taivutusmuotoja, ja uusia sanoja pystytään myös luomaan käyttämällä johtimia. Jos n-grammien yksikköinä käytetään saneita, esimerkkijoukon tulee olla käsittämättömän suuri, jotta se voi kuvata Suomen kieliopin kattavasti. Tämä pro gradu -työ esittää kieliopintarkistusmetodin, joka on helppo toteuttaa, koska siinä käytetään n-grammeja yllä mainitulla tavalla, mutta n-grammien yksikköinä käytetään part-of-speech (POS) -informaatiota saneiden sijaan, jolloin esimerkkijoukon n-grammit on mahdollista kerätä, ja niiden määrä pysyy tarpeeksi pienenä käsiteltäväksi. N-grammit ja niiden esiintymäkertojen lukumäärät kerätään suomenkielisestä morfologisesti annotoidusta FinnTreeBank -korpuksesta. Kieliopintarkistin arvioidaan 200 eri koeasetelmassa, jotka eroavat toisistaan viidellä eri tavalla. Puolet tarkistimista koulutetaan pienellä käsinannotoidulla korpuksella ja puolet suurella automaattisesti annotoidulla korpuksella. Puolet tarkistimista käyttää lauserajamerkintöjä n-grammeissaan ja puolet ei. Puolissa asetelmissa valitaan yksi lauserakenteen tulkinta tarkistettavaksi, ja puolissa tarkistetaan kaikki mahdolliset rakennetulkinnat. Jokainen tarkistimista käyttää myös yhtä viidestä esiintymäkertojen raja-arvoista, joka n-grammien tulee ylittää, jotta ne hyväksytään kieliopillisiksi. Lisäksi jokainen tarkistimista käyttää yhtä viidestä POS n-grammityypistä, joista jokainen sisältää eri yhdistelmän POS-informaatiota. Kieliopintarkistin arvioidaan konekäännösjärjestelmän tuottamilla kieliopillisesti virheellisillä lauseilla sekä niiden kieliopillisesti oikeilla vastineilla. Suurimmassa osassa koeasetelmia tarkistin merkitsee vain vähän virheitä ja on usein väärässä, tai tarkistin merkitsee lähes kaikki lauseet, myös kieliopilliset, virheellisiksi. Tarkkuuden kannalta parhaiten suoriutuneessa asetelmassa käytetään suurta korpusta, ei lauserajamerkintöjä, kaikki lauserakennetulkinnat tarkistavaa metodia, pientä esiintymäkertaraja-arvoa ja POS-informaatiota, jolla on vähiten mahdollisia esiintymämuotoja. Tässä asetelmassa tarkistin on noin 86% kerroista oikeassa merkitessään kielioppivirheitä, mutta toisaalta se löytää vain noin 27% testiaineiston virheistä. Toteutettu metodi ei siis sellaisenaan ole toimivia Suomen kieliopin tarkastamiseen, mutta metodia voisi parantaa lisäämällä siihen disambiguaatiokomponentin ja käyttämällä suurempaa koulutuskorpusta

    OpusFilter : A Configurable Parallel Corpus Filtering Toolbox

    Get PDF
    This paper introduces OpusFilter, a flexible and modular toolbox for filtering parallel corpora. It implements a number of components based on heuristic filters, language identification libraries, character-based language models, and word alignment tools, and it can easily be extended with custom filters. Bitext segments can be ranked according to their quality or domain match using single features or a logistic regression model that can be trained without manually labeled training data. We demonstrate the effectiveness of OpusFilter on the example of a Finnish-English news translation task based on noisy web-crawled training data. Applying our tool leads to improved translation quality while significantly reducing the size of the training data, also clearly outperforming an alternative ranking given in the crawled data set. Furthermore, we show the ability of OpusFilter to perform data selection for domain adaptation.This paper introduces OpusFilter, a flexible and modular toolbox for filtering parallel corpora. It implements a number of components based on heuristic filters, language identification libraries, character-based language models, and word alignment tools, and it can easily be extended with custom filters. Bitext segments can be ranked according to their quality or domain match using single features or a logistic regression model that can be trained without manually labeled training data. We demonstrate the effectiveness of OpusFilter on the example of a Finnish-English news translation task based on noisy web-crawled training data. Applying our tool leads to improved translation quality while significantly reducing the size of the training data, also clearly outperforming an alternative ranking given in the crawled data set. Furthermore, we show the ability of OpusFilter to perform data selection for domain adaptation.Peer reviewe

    The OPUS Resource Repository : An Open Package for Creating Parallel Corpora and Machine Translation Services

    Get PDF
    This paper presents a flexible and powerful system for creating parallel corpora and for running neural machine translation services. Our package provides a scalable data repository backend that offers transparent data pre-processing pipelines and automatic alignment procedures that facilitate the compilation of extensive parallel data sets from a variety of sources. Moreover, we develop a web-based interface that constitutes an intuitive frontend for end-users of the platform. The whole system can easily be distributed over virtual machines and implements a sophisticated permission system with secure connections and a flexible database for storing arbitrary metadata. Furthermore, we also provide an interface for neural machine translation that can run as a service on virtual machines, which also incorporates a connection to the data repository software.Peer reviewe

    OpusTools and Parallel Corpus Diagnostics

    Get PDF
    12th Edition of its Language Resources and Evaluation Conference was cancelled due to Covid 19 pandemic.This paper introduces OpusTools, a package for downloading and processing parallel corpora included in the OPUS corpus collection. The package implements tools for accessing compressed data in their archived release format and make it possible to easily convert between common formats. OpusTools also includes tools for language identification and data filtering as well as tools for importing data from various sources into the OPUS format. We show the use of these tools in parallel corpus creation and data diagnostics. The latter is especially useful for the identification of potential problems and errors in the extensive data set. Using these tools, we can now monitor the validity of data sets and improve the overall quality and consitency of the data collection.Peer reviewe

    Open Translation Models, Tools and Services

    Get PDF
    Publisher Copyright: © 2023, The Author(s).The ambition of the Open Translation Models, Tools and Services (OPUSMT) project is to develop state-of-the art neural machine translation (NMT) models that can freely be distributed and applied in research as well as professional applications. The goal is to pre-train translation models on a large scale on openly available parallel data and to create a catalogue of such resources for streamlined integration and deployment. For the latter we also implement and improve web services and computer-assisted translation (CAT) tools that can be used in on-line interfaces and professional workflows. Furthermore, we want to enable the re-use of models to avoid repeating costly training procedures from scratch and with this contribute to a reduction of the carbon footprint in MT research and development. The ELG pilot project focused on European minority languages and improved translation quality in low resource settings and the integration of MT services in the ELG infrastructure.Peer reviewe

    The University of Helsinki Submission to the IWSLT2020 Offline Speech Translation Task

    Get PDF
    This paper describes the University of Helsinki Language Technology group’s participation in the IWSLT 2020 offline speech translation task, addressing the translation of English audio into German text. In line with this year’s task objective, we train both cascade and end-to-end systems for spoken language translation. We opt for an end-to-end multitasking architecture with shared internal representations and a cascade approach that follows a standard procedure consisting of ASR, correction, and MT stages. We also describe the experiments that served as a basis for the submitted systems. Our experiments reveal that multitasking training with shared internal representations is not only possible but allows for knowledge-transfer across modalities.Peer reviewe

    Paraphrase Detection on Noisy Subtitles in Six Languages

    Get PDF
    We perform automatic paraphrase detection on subtitle data from the Opusparcus corpus comprising six European languages: German, English, Finnish, French, Russian, and Swedish. We train two types of supervised sentence embedding models: a word-averaging (WA) model and a gated recurrent averaging network (GRAN) model. We find out that GRAN outperforms WA and is more robust to noisy training data. Better results are obtained with more and noisier data than less and cleaner data. Additionally, we experiment on other datasets, without reaching the same level of performance, because of domain mismatch between training and test data.Peer reviewe

    Annotation of subtitle paraphrases using a new web tool

    Get PDF
    This paper analyzes the manual annotation effort carried out to produce Opusparcus, the Open Subtitles Paraphrase Corpus for six European languages. Within the scope of the project, a new web-based annotation tool was created. We discuss the design choices behind the tool as well as the setup of the annotation task. We also evaluate the annotations obtained. Two independent annotators needed to decide to what extent two sentences approximately meant the same thing. The sentences originate from subtitles from movies and TV shows, which constitutes an interesting genre of mostly colloquial language. Inter-annotator agreement was found to be on par with a well-known previous paraphrase resource from the news domain, the Microsoft Research Paraphrase Corpus (MSRPC). Our annotation tool is open source. The tool can be used for closed projects with restricted access and controlled user authentification as well as open crowdsourced projects, in which anyone can participate and user identification takes place based on IP addresses.Peer reviewe

    The FISKMÖ Project : Resources and Tools for Finnish-Swedish Machine Translation and Cross-Linguistic Research

    Get PDF
    This paper presents FISKMÖ, a project that focuses on the development of resources and tools for cross-linguistic research and machine translation between Finnish and Swedish. The goal of the project is the compilation of a massive parallel corpus out of translated material collected from web sources, public and private organisations and language service providers in Finland with its two official languages. The project also aims at the development of open and freely accessible translation services for those two languages for the general purpose and for domain-specific use. We have released new data sets with over 3 million translation units, a benchmark test set for MT development, pre-trained neural MT models with high coverage and competitive performance and a self-contained MT plugin for a popular CAT tool. The latter enables offline translation without dependencies on external services making it possible to work with highly sensitive data without compromising security concerns.Peer reviewe

    Democratizing Neural Machine Translation with OPUS-MT

    Full text link
    This paper presents the OPUS ecosystem with a focus on the development of open machine translation models and tools, and their integration into end-user applications, development platforms and professional workflows. We discuss our on-going mission of increasing language coverage and translation quality, and also describe on-going work on the development of modular translation models and speed-optimized compact solutions for real-time translation on regular desktops and small devices
    corecore